1.6. Support Vector Classifier (SVC)
1.7. Support Vector Machine (SVM)
1.8. Exercises
TODO
We can't always perfectly separate the data with a $K − 1$ dimensional hyperplane. To overcome this problem we could either:
SVC's are a generalisation and extension of the maximal margin classifier so it can be applied to a broader range of cases1.
In practice they are more robust to individual observations and better classify most training observations than the Maximal Margin Classifier. This is because they take the approach it is better to missclassify some training examples in order to do a better job classifying the rest.
This is called a soft margin as it allows some violations by the training data by a small subset of training observation, not only on the wrong side of the margin, but wrong side of the hyperplane.
We want to relax the constraints
$x_i \cdot w + b \geq +1 \quad \text{for} \ y_i = +1$,
$x_i \cdot w + b \leq -1 \quad \text{for} \ y_i = -1$,
when necessary.
This can be done firstly by introducing positive slack variables $\xi_i, i = 1, \ldots , l$ in the constraints5, which then become6:
$x_i \cdot w + b \geq +1 - \xi_i \quad \text{for} \ y_i = +1$,
$x_i \cdot w + b \leq -1 + \xi_i \quad \text{for} \ y_i = -1$,
$\xi_i \geq 0 \quad \forall_i$.
To ensure there is a penelty, $C$, for relaxing the constraint, we can change our objective function to be minimised from $\frac{||w||^2}{2}$ to,
$\frac{||w||^2}{2}+C(\sum_i\xi_i)^k$.
If we set $k=1$ neither the $\xi_i$ or their Lagrange multipliers appear in the Wolfe dual problem. This means we now have6:
$L_D \equiv \sum_i\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_jx_i\cdot x_j \quad \text{s.t.} \quad 0 \leq \alpha_i \leq C, \quad \sum_i\alpha_iy_i = 0$.
This also has the same solution as before:
$w = \sum^{N_S}_{i=1}\alpha_iy_ix_i$.
$C$ is a tuning parameter that controls the bias-variance trade-off.
When $C$ is small we have narrow margins rarely violated, but highly fit to the training data (low bias-high variance). Coversely, when larger, the margin is wider amounting to less hard fitting (high bias-low variance).
Like most hyper-parameters, it is often chosen using cross-validation.
Alike to maximal margin classifiers, SVC's only rely on a few observations, those on the margin or those that violate the margin (Support Vectors) - if they are on the correct side of the margin they dont change the classifier. This does mean that they are robust to observations far away from the hyperplane.
TODO
It is common to define $C$ as $C = \frac{1}{\nu N}$, where $0 < \nu \leq 1$ controls the fraction of misclasified points during the training phase 7.
Aims to address the situation where the boundry between two classes is not linear.
We could consider enlarging the feature space using quadratic, cubic or higher-order polynomial functions, however the larger the number of features, the higher computational burden.
Instead it is common to enlarge the feature space using an extension of a SVC termed a Support Vector Machine, which uses kernels.
Polynomial Feature Engineering 1.46 ms ± 35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Polynomial Kernel 1.1 ms ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The Kernel trick can relies on the fact we can define our SVM in the form of dot products, $x_i\cdot x_j$.
Imagine we had a mapping $\Phi$ which maps the data to some high dimensional Euclidean space
$\Phi: \mathbb{R}^d \mapsto H$,
then we would still be using dot products in $H$. However we could instead use a kernel function
$K(x_i,x_j) = \Phi(x_i)\cdot \Phi(x_j)$,
meaning we don't need to worry about $\Phi$; saving us computation.
We still have all the same considerations, but replacing $x_i\cdot x_j$ with $K(x_i, x_j)$ allows us to produce a SVM in infinite dimensional space.
In the test phase, instead of test point computing the sign of $x^*$ with $w$, we can use the support vectors, $s_i$:
$f(x) = \sum_{i = 1}^{N_S}\alpha_iy_i\Phi(s_i)\cdot \Phi(x) +b = \sum_{i = 1}^{N_S}\alpha_iy_iK(s_i,x) +b$,
avoiding computing $\Phi(x)$.
Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data mining and knowledge discovery, 2(2), 121-167.
When using a SVM model trained using a RBF kernel classifies a test observation $x^* = (x^*_1...x^*_p)T$, only training observaations close to $x^*$ (in terms of Euclidean distance) will play a role in its class label. This is because $(x^*_j-x_{ij})^2$ will be large, so $exp(-\gamma\sum^P_{j=1}(x^*_j-x_{ij})^2)$ will be small.
[MORE INFO ON RBF kernels]
[MORE INFO ON HOW C AND GAMMA INTERACT]
TODO
Phi(-1.0, -2) = [0.74081822] Phi(-1.0, 1) = [0.30119421]
"It is this fact that allows us to construct hyperplanes in these very high dimensional spaces yet still be left with a tractable computation. Thus SVMs circumvent both forms of the “curse of dimensionality”: the proliferation of parameters causing intractable complexity, and the proliferation of parameters causing overfitting." Burges (1998)
"Perhaps the biggest limitation of the support vector approach lies in choice of the kernel. Once the kernel is fixed, SVM classifiers have only one user-chosen parameter (the error penalty), but the kernel is a very big rug under which to sweep parameters. Some work has been done on limiting kernels using prior knowledge (Scholkopf et al., 1998a; Burges, 1998), but the best choice of kernel for a given problem is still a research issue." Burges (1998)
"A second limitation is speed and size, both in training and testing. While the speed problem in test phase is largely solved in (Burges, 1996), this still requires two training passes. Training for very large datasets (millions of support vectors) is an unsolved problem" Burges (1998)
"Discrete data presents another problem, although with suitable rescaling excellent results have nevertheless been obtained (Joachims, 1997)." Burges (1998)
Now might be a good time to try exercises X-X.
web1. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html web2. https://scikit-learn.org/stable/datasets/toy_dataset.html